Computational assessment of groundwater salinity distribution within coastal multi-aquifers of Bangladesh

The rising salinity trend in the country’s coastal groundwater has reached an alarming rate due to unplanned use of groundwater in agriculture and seawater seeping into the underground due to sea-level rise caused by global warming. Therefore, assessing salinity is crucial for the status of safe groundwater in coastal aquifers. In this research, a rigorous hybrid neurocomputing approach comprised of an Adaptive Neuro-Fuzzy Inference System (ANFIS) hybridized with a new meta-heuristic optimization algorithm, namely Aquila optimization (AO) and the Boruta-Random forest feature selection (FS) was developed for estimating the salinity of multi-aquifers in coastal regions of Bangladesh. In this regard, 539 data samples, including ten water quality indices, were collected to provide the predictive model. Moreover, the individual ANFIS, Slime Mould Algorithm (SMA), and Ant Colony Optimization for Continuous Domains (ACOR) coupled with ANFIS (i.e., ANFIS-SMA and ANFIS-ACOR) and LASSO regression (Lasso-Reg) schemes were examined to compare with the primary model. Several goodness-of-fit indices, such as correlation coefficient (R), the root mean squared error (RMSE), and Kling-Gupta efficiency (KGE) were used to validate the robustness of the predictive models. Here, the Boruta-Random Forest (B-RF), as a new robust tree-based FS, was adopted to identify the most significant candidate inputs and effective input combinations to reduce the computational cost and time of the modeling. The outcomes of four selected input combinations ascertained that the ANFIS-OA regarding the best accuracy in terms of (R = 0.9450, RMSE = 1.1253 ppm, and KGE = 0.9146) outperformed the ANFIS-SMA (R = 0.9406, RMSE = 1.1534 ppm, and KGE = 0.8793), ANFIS-ACOR (R = 0.9402, RMSE = 1.1388 ppm, and KGE = 0.8653), Lasso-Reg (R = 0.9358), and ANFIS (R = 0.9306) models. Besides, the first candidate input combination (C1) by three inputs, including Cl− (mg/l), Mg2+ (mg/l), Na+ (mg/l), yielded the best accuracy among all alternatives, implying the role importance of (B-RF) feature selection. Finally, the spatial salinity distribution assessment in the study area ascertained the high predictability potential of the ANFIS-OA hybrid with B-RF feature selection compared to other paradigms. The most important novelty of this research is using a robust framework comprised of the non-linear data filtering technique and a new hybrid neuro-computing approach, which can be considered as a reliable tool to assess water salinity in coastal aquifers.

people globally depend on groundwater for water supply and to keep many key terrestrial ecosystems alive 3 . In many parts of the world, excess water exploration due to increasing water demand has threatened their long-term viability 4,5 . The encroachment of seawater into coastal aquifers and the removal of water from coastal aquifers have caused erratic changes in the quality of groundwater and water flow patterns.
Recently, the growing demand for groundwater resources has subjected such natural resources to more strain than ever before 6,7 . In unconfined coastal aquifers, the most typical problem is groundwater salinization, mainly when excessive groundwater pumping reduces the piezometric head 8 . Confining layers that partly separate seawater from groundwater and the vadose zone has complicated the hydrogeological processes in confined and semi-confined coastal aquifers 9 . As a result, seawater can easily intrude into semi-confined coastal aquifers. The problem of groundwater depression has been documented in various cases worldwide [10][11][12] . The increasing level of salt accumulation in both plants and soil has significantly increased groundwater salinity, which has negatively impacted ecological health, the economic advancement of residents, and the productivity of coastal crops. Increased salinity also hurts drinking water quality, thereby jeopardizing human health 13,14 . Furthermore, groundwater salinization results in an increased quantity of salt in the root zone, which creates an osmotic impact on plants, forcing them to expend more energy to absorb water from the soil, thereby limiting their ability to develop 15 .
Another issue is that excessive groundwater extraction and poor management have established local or regional groundwater depressions in several regions 16 . Prolonged over-extraction of groundwater can result in depression, as seen in the North China Plain, where several towns, including Cangzhou, Dezhou, and Tianjin, were in severe water depression 17 . Excessive groundwater extraction from aquifers in Dhaka, Bangladesh, has also been reported to cause an exponential fall in groundwater levels around the city and water qualityrelated issues 18 . Several wells were installed in Tripoli, Northwest Libya, to pump groundwater in the city; this act caused a sharp decline in groundwater level; a further construction of a groundwater depression caused the limited discharge of freshwater to the ocean 19 . Groundwater formation may also result in geological risks, such as ground cracks and land subsidence. In China, excessive groundwater extraction in some representative locations has been reported to cause geological problems due to prolonged over-extraction consistently distributed with groundwater depression 20 .
Based on the reported literature, the salinity in groundwater is stochastic, and this is because several parameters affect its concentration and magnitude. Those parameters include upward instruction from deep aquifers 21 , evaporation rate from soil 22 , irrigation saline water, and wastewater infiltration 23 . Hence, understanding the actual mechanism of groundwater salinization and the affected sources is essential for water resources management and sustainability 24,25 .
Worth mentioning, Electrical conductivity (EC) is used to explain the salinity of water; the concentration of dissolved salts is a metric to determine the EC of groundwater 26 . In a thoroughly prepared groundwater sample, EC is generally tested by creating an electric current between the two electrodes of a salinometer. This approach is point-based and analyses the EC of the tested groundwater samples. Although this process is accurate, the preparation step makes it time-consuming to perform over a large area. In the local setting, direct current resistivity methods examine EC distribution. In this method, the potential field is determined using two additional electrodes; a current is created and delivered into the ground by point electrodes 27 . However, this is a slow procedure that cannot be applied on a regional scale. Previous studies on mapping groundwater salinity based on a regional scale using the feasibility of remote sensing data have been conducted remarkably [28][29][30][31] . However, the main concept of using geographical information system data for mapping the salinity is to estimate the salinity for unknown data points using the interpolation technique within a defined range of measured data points. The sensing technique has its merits. It is quick and straightforward to apply; however, it is connected with significant error calculation and is proportional to the square of the distance between data points 32 . Also, it does not consider the sample's distribution in high salinity areas. Hence, introducing new technology, such as computer aid models for solving this complex and significant natural issue, is one of the prioritized research topics in water and geo-science.
The ability to predict groundwater quality is crucial to comprehending its evolution trend 33,34 . This is especially useful for determining groundwater quality in the groundwater depression cone zone. Numerical modeling and statistical prediction methods are available for predicting groundwater quality. However, machine learning (ML) models have been recently reported as a new method that substantially impacts groundwater quality modeling 35,36 . As a known data analysis method, ML models can automate the framework of an analytical model. Artificial intelligence (AI) is based on the idea that machines can learn from data, recognize patterns, and make decisions with minimal human interaction 37,38 . The ability of ML models to model groundwater salinity has been demonstrated via the establishment of a linear or non-linear relationship between water salinity and its control parameters (such as water table, evaporation, and distance to saltwater bodies) and using those relationships for the prediction of water salinity for regions with unavailable data points 39,40 . Various versions of ML models have been reported in the literature, such as artificial neural network (ANN) [41][42][43][44][45] , support vector machine (SVM) [46][47][48] , adaptive neuro-fuzzy inference system (ANFIS) 49,50 , ensemble ML models 38,51,52 , group method of data handling (GMDH) 53 , and Gaussian process scheme 54 . The significant limitations associated with predictive ML models (1) the need for adequate input variables to explain the target data that may not be available everywhere 55,56 , (2) the influence of well excessive pumping 57,58 , (3) the reliability of the learning process of the predictive model where essential hyperparameters are optimized 59,60 , (4) coupled ML models where a pre-processing technique was integrated for data time series decomposition 61,62 . The ML model was adopted based on the inspiration of developing a new hybrid model for the ANFIS model. In groundwater quality modeling, hybrid ANFIS showed a promising result 63,64 . Due to the highly non-linear relationships between input predictors and water quality targets in coastal aquifers, using a scientific-based strategy to determine the optimal candidate input combinations for feeding the ML www.nature.com/scientificreports/ methods is very important. It has received less attention in the previous literature. In most previous research, regardless of the behavior of the data, a certain number of possible input combinations have been examined using the ML methods, and superior results have been presented. However, selecting specific combinations without a scientific basis may increase the uncertainty and decrease the accuracy of the outcomes. This motivated this study focuses on three significant aspects. The aims of the current investigation, novel predictive models, were developed based on the hybridization of the ANFIS model with new nature-inspired optimization algorithms (i.e., Aquila optimization) for groundwater salinity prediction. The second aim is to inspect the highly influential predictors using the newly explored feature selection algorithm, Boruta-Random Forest. The outcomes of the primary model were examined with standalone ANFIS, ANFIS-SMA, ANFIS-ACOR, and Lasso-Reg approaches. Finally, the current research was adopted on a significant case study, "coastal areal of Bangladesh, " where the groundwater salinity is a vital issue for that region. The current research can provide an essential vision for introducing a reliable computer aid model.

Materials and methods
Study area description. With approximately 24,000 km2, the coastal regions are primarily low-lying areas in the southern portion of Bangladesh along the Bay of Bengal (BoB). Diverse geomorphic characteristics of the coastal district include the deep funnel-shaped structures of the BoB's northern landfill. Tidal fluctuation is prevalent in most river systems from the coastal area, causing fluctuations of 2-4 m. Groundwater pollution is caused by a rise in the relative sea level, rapid population increase, a poor drainage system, salinity intrusion, and other factors. Aquaculture and agricultural activities are the primary sources of income in the coastal regions. Due to rising soil salinity, aquaculture, particularly shrimp culture, has expanded while paddy crop production has decreased 65 . The typical coastal climatic system is characterized by a warm and tropical environment dominated by the BoB's southwest monsoonal flow. The average annual precipitation and temperature (June-September) are 2000-2500 mm and 25 °C, respectively. The Bengal Basin's coastal area started in the late Holocene to the Recent Age 66 . The study area formed the basin's deeper portion throughout the Holocene age. The lithology of this area is varied, with coarse-to finegrained sandstone and peat soil combined with silty clay 65 . Each sediment layer containing groundwater comprises coarse-grain sand, fine-grained silt, and clay 67 . The coastal region's hydrogeology comprises unconsolidated Quaternary alluvial sediments that are covered by a thick (3 to 7 m) silty-clay layer. The shallow aquifer depth ranged from 10 to 50 m, with salt concentrations ranging from 1500 mg/L to 2000 mg/L. Rainwater collecting, especially during the monsoon season, is another viable source of freshwater. The people rely on salinity-rich rivers, channels, and fishponds for their water supply 68 . Furthermore, rainwater recharges the shallow aquifer during the dry winter season, which is invaded excessively 69 .
There are three types of aquifers 70 . The upper shallow aquifer is located northwest of the coastal area, with a thickness varying from a few meters to 60 m. Second, the shallow aquifer, which has a thickness ranging from 10 m to more than 100 m 71 , contains saltwater pockets, particularly abundant in coastal and estuarine flooded areas. Third, there is a deep aquifer with a thickness of more than 200 m and various features in the southern portion of the coastal zone.
The water in this region's aquifer is frequently replenished by rainfall, rivers, and channels 72 . The groundwater is significantly depleted during the dry and monsoon seasons and subsequently refilled. Groundwater flow may have aided the saltwater infiltration into the water-bearing strata. The parent rock impacts the water chemistry, and numerous types of minerals found in the aquifer regulate the water quality. According to the available geological data, the aquifers in this area are either unconfined or semi-confined.
Data description and sampling technique. For using machine learning methods, a large dataset is required. The datasets used in this study came from 65 and 73 , and the sampling design and analytical procedures are described below. During the wet season, 539 groundwater samples were collected from three campaigns between 2015 and 2017. Each sample was assigned an ID number, and coordinates were confirmed using a portable GPS device 74 , as shown in Fig. 1. Before collecting the sample from the tube well, the groundwater was pumped for at least 10 min to remove any standing water.
The pumping of the sampling tube well was continued until the pH and electrical conductivity (EC) were both steady. The samples were collected in prewashed high-density polypropylene (HDPP) bottles 66,75 . It is worth noting that each station collected two sets of replicated samples. The samples were collected and filtered using 0.45 m membranes from MF-Millipore™ in the United States. The samples' HDPP bottles were kept at 4 °C in a more excellent box and subsequently sent to the laboratory for further analysis. While EC and pH were measured using portable pH/EC meters (Hanna HI 9811-5).
A field survey was used to measure groundwater depth and salinity during the wet seasons. Ion chromatography with Dionex ICS-90 was used to determine cations (Ca 2+ , Mg 2+ , Na + , K + ) and anions (Cl -, HCO 3 -, NO 3 -, PO 4 2-, SO 4 2-, and F -). Five standard solutions (1, 5, 10, 15, and 20 mg/L) were employed during the calibration process. A conventional laboratory process and quality control checks were used to provide quality assurance. Three replicated samples were obtained simultaneously to ensure that the test results were accurate by crosschecking with a qualified laboratory. The ion charge balance error (ICBE), which was used to determine accuracy, ranged from 2.63 to 8.62 percent, with a mean of 8.24 percent, well within the permissible limit of 10%. Table 1 lists the descriptive statistics of datasets used in the salinity assessment of the multi-aquifers in coastal regions of Bangladesh. As can be seen, the maximum kurtosis values were owing to the PO 4 -2 (mg/l), K + (mg/l), and Ca 2+ (mg/l).  76 . Boruta is an algorithm for feature selection. More precisely, it acts as a wrapper algorithm for Random Forest. This algorithm is named after a monster from Slavic folklore who inhabited pine trees. The stage of feature selection is critical in predictive modeling. This strategy is vital when a data set including several variables is provided for model construction. This is particularly true when the goal is to understand the mechanics behind the interest variable rather than merely build a high-prediction-accuracy black-box model. Boruta determines the Z-scores for each input predictor concerning the shadow property. The distribution of Z-score metrics reveals the essential characteristics of the predictors 77 . A minimal-optimal feature selection technique was used by ranking and residuals based on the Boruta-determined relevance criteria, followed by stepwise model development 78 .
1. To begin, it randomizes the input data set by making scrambled duplicates of all features (shadow features).  where OOB denotes out-of-bag (i.e., the prediction error for each of the training trials aggregated by bootstrap), whereas (y t = f (x t )) and (y t = f (x t )) denote the predicted values before and after permutation, respectively. Additionally, I() denotes the indicator function. 3. Each iteration determines if a genuine feature is more essential than the best of its shadow features (i.e., whether the feature has a higher Z score than the shadow features' maximum Z score) and continually eliminates features thought to be very irrelevant. The Z-score is computed as follows: where std is the standard deviation of accuracy losses, and then the maximum Z-score for duplicate attributes was computed (MZSA). 4. If Z-scores are less than MZSA, the inputs are tagged "unimportant" and separated permanently until inputs with Z-scores more than MZSA are designated "Confirmed". 5. Finally, the method terminates when all features have been validated or rejected or the required number of random forest iterations reached.
Lasso regression. Robert Tibshirani coined the term LASSO in 1996 79 . It is a robust approach that accomplishes two primary tasks: regularization and feature selection. The Lasso approach constrains certain of the model parameters' absolute values. The total must be less than a preset value (upper bound). To do this, the approach employs a shrinkage (or regularization) procedure in which it penalizes the coefficients of regression variables, thereby shrinking them to zero 80 . Incorporating a penalty item into linear regression may dramatically reduce the variance of a model by effectively shrinking the coefficient estimates, particularly in models with high-dimension predictors 81 . The optimized objective function of Lasso Regression (Lasso-Reg) is as follows: where β 0 denotes the Lasso-Reg shift and β j denotes the x ij coefficients. In this relation, Ŵ is a regulation parameter, which means that if its value is equal to zero, the model becomes a normal regression, and all variables will be present, and if its value increases, the number of independent variables in the model will decrease. The determination of the value for this parameter is usually done by the cross-validation method 80 .
Adaptive neuro-fuzzy inference system (ANFIS). The ANFIS technique is based on a knowledgebased mix of fuzzy inference systems (FIS) and artificial neural networks (ANN). The FIS can generate IF-THEN fuzzy rules from fuzzy sets with an adequate membership function (MF) to represent human thought, but its capabilities are restricted to adaptive learning 82 . While ANNs are capable of adaptive learning for decision-making, they cannot explain how the choice was formed. Thus, incorporating adaptive learning capabilities from ANNs into the IF-THEN fuzzy rules in FIS structures becomes more powerful and may be utilized to tackle complicated engineering or non-engineering issues in various applications 83 .
The ANFIS model is used in this work due to its high capacity for learning and superior performance 84 . The ANFIS model is structured in two parts: antecedent and consequent. To keep things simple, the ANFIS structure is configured with two inputs, x, and y, as seen in Fig. 2. The ANFIS model is composed of five levels structurally. Each level has a distinct role, detailed below 85 . (1) The most significant notion in ANFIS is determining the number of membership functions. This may be regarded as a clustering problem; consequently, the FCM is employed to create a limited number of fuzzy rules.
Bezdek invented the FCM in 1984 86 . Each data point in the FCM method belongs to one of the clusters with a membership value that varies from zero to one. The FCM may be obtained by optimizing the objective function below 87 .
where u ij is membership degree, c is the total number of clusters, m is a constant value, �x j − v i � is Euclidean distance of x j from i th cluster center v i . In the FCM technique, the cluster center and membership degree may be computed using the following Eqs. 87 : Aquila optimizer (AO). Aquila Optimizer (AO) is a novel nature-inspired algorithm proposed by Abualigah et al. in 88 . The following subsections explain how the AO models these processes.
AO simulates Aquila's hunting behavior by demonstrating the actions taken at each hunt stage 89 . The AO algorithm's optimization processes are divided into four categories. The following is a mathematical model of the AO.
Step 1: Expanded exploration. The Aquila accepts the prey area and chooses the best hunting area by high soar with the vertical stoop in the first searching technique (X 1 ). Equation (3) represents this behavior mathematically 90 .
Layer 2 Layer 3 Layer 4 Layer 5 x y www.nature.com/scientificreports/ where X1(t + 1) is the solution created by the first search method for the next iteration of t. (X 1 ). The bestobtained solution at the tth iteration is Xbest(t), representing the prey's approximate location. (1 − t T ) is used to control the number of iterations in the expanded search (exploration). A random value between 0 and 1 is called rand. The current iteration and the maximum number of iterations are represented by t and T, respectively.
where Dim is the problem's dimension size and N is the number of possible solutions in the population.
Step 2: Narrowed exploration. When the prey area is discovered from a high vantage point, the Aquila circles above the target prey prepares the land, and then attacks. Equation (15) represents this behavior mathematically 91 .
where X 2 (t + 1) is the solution created by the second search method (X 2 ) for the next iteration of t. Levy(D) is the levy flight distribution function calculated using Eq. (16). At the ith iteration, XR(t) is a random solution where s is a constant value of 0.01, u, and ν are random integers between 0 and 1, and σ is a value calculated by using Eq. (17).
where β is a constant with a value of 1.5; in Eq. (15), the spiral shape in the search is represented by y and x, which are computed as follows. Where For a fixed number of search cycles, r 1 takes a value between 1 and 20, and U is a small value set to 0.00565. D 1 is an array of integer numbers ranging from 1 to the search space length (Dim), and ω is a small value set to 0.005. Figure 3 depicts the AO's behavior in a spiral shape.
Step 3: Expanded exploitation. When the prey area is precisely defined and the Aquila is ready to land and attack, the third technique (X 3 ) is used. Equation (20) represents this behavior mathematically 92 .
where X 3 (t + 1) is the solution of the next iteration of t, which is generated by the third search method (X 3 ). Xbest(t) refers to the approximate location of the prey until ith iteration, and X M (t) denotes to the mean value of the current solution at tth iteration, which is calculated using Eq. (14). "rand" is a random value between 0 and 1. α and δ are the exploitation adjustment parameters fixed in this paper to a small value (please refer to Table 3).
Step 4: Narrowed exploitation. When the Aquila approaches the prey in the fourth technique (X 4 ), the Aquila attacks the prey over land based on their stochastic movements. Equation (21) represents this behavior mathematically 89 .
where X4(t + 1) is the solution of the fourth search method's (X4) for the next iteration of t, the quality function (Q F ) is used to balance the search strategies and is calculated using Eq. (21). G 1 refers to various AO motions that are used to track the prey during the elope and are generated using Eq. (21). G 2 shows decreasing values from 2 to 0, which represent the AO's flight slope as it follows the prey during the elope from the first (1) to the last (t) location, as calculated using Eq. (21). The current solution at the tth iteration is X(t). www.nature.com/scientificreports/ The quality function value at the tth iteration is Q F (t), and the random value between 0 and 1 is rand. The current iteration and the maximum number of iterations are represented by t and T, respectively. The levy flight distribution function computed using Equation is Levy(D) (6). The effects of the Q F , G 1 , and G 2 on the AO's behavior are shown in Fig. 3. The Pseudo-code of the AO is detailed in Algorithm 1.  where i ∈ 1, 2, 3, . . . . , n , S(i) is the fitness function of the current solution, DF is the best-obtained fitness value. The − → vb is dertermined as follows: The − → W is determined as follows: where r is a random value in [0, 1] , bF is the best-obtained fitness value, wF is the worst obtained fitness value, SmellIndex is sorted fitness value. Figure 4a shows the impacts of potential positions 95 .

• Wrap food
When the food product is pleased to stretch to a place where the food quantity is weak, the priority of that region decreases, causing researchers to shift their attention to other regions of food availability that are not as significant as the food product. Figure 4b depicts the rule for assessing slime mould fitness values.
The mathematical representation for updating positions is given as follows:  . Although slime mould received a better feed supply, it would still spread organic material for seeking other locations for an upper-class food supply rather than investing all of it in a single area to discover a more reliable supply of nutrition. The SMA algorithm's mechanism is depicted in Algorithm 2. 96 , is a multi-agent approach used to tackle optimization issues. This method is based on observational data of real ants seeking food. Ants are small social insects in colonies and cooperate to ensure the colony's survival. While hunting for food, the ants inspect their surroundings and mark them with pheromones, which the colony's other members can follow. When ants locate a food supply, they attempt to nurture it by transporting it back to the colony via the nearest root 97 . The ACO algorithm uses a discrete structure to determine the answer. The concept of discrete structure in ACO means that each decision variable in the defined interval is divided into a certain number of states. By discretizing the space of variables, there is a limit to the algorithm, reducing accuracy. In this regard, ACO generalization to continuous space was considered. If the decision variable space is assumed to be continuous, the algorithm will move in the R space of real numbers. The ACOR algorithm performs spatial integration in decision variables using a probability density function (PDF). Sosha and Dorigo proposed using a Gaussian function to create such a structure 98 . A one-dimensional Gaussian function cannot produce a maximum of several points, whereas using a Gaussian kernel function, the sum of the weights of several single Gaussian functions, can perform such a task. The following Equation defines the weighted sum of 1-D Gaussian functions:

Ant colony optimization (ACO). Ant colony optimization (ACO), invented by Dorigo
where i is the dimension of the problem, k denotes the total number of best solutions in the solution repository. w l is the weight that each solution receives based on its rank and it can be calculated using following Eq. 98 : r l is rank of solutions. The q parameter (Intensification Factor) affects the minimum and maximum limits of w l . When q is small, the solutions with the highest rankings are highly preferred. Equation 29 is used to compute the elements of the weight vector x. Following that, the sampling process is finished in two steps. The first step is The chosen Gaussian function is sampled in the second phase. This can be accomplished by employing a random number generator capable of producing random numbers based on a parameterized normal distribution. The standard deviation of a normal distribution PDF is σ i l and it is determined using the following Equation: where e denotes the iteration and k denotes the solution's number in the solution archive. ξ (Deviation-Distance Ratio) is a parameter that controls the convergence speed. The algorithm's convergence speed decreases with increasing ξ . Solution S i l has rank l and S i e is the solution in the current iteration 98 . The ACO R method refines and regenerates the solution archive in each iteration by adding m new solutions ( k → k + m ) and then removing the worst m solutions ( k + m → k ) in order to maintain the solution archive's size constant (negative and positive update). As a result of the changes to the solutions recorded in the solution archive, the pheromone for each iteration is increased in optimized paths that do not improve the objective function. Thus, the ACO R algorithm finds the optimal solution. For more simplicity, hereafter, the ACO R is called ACO.

Model development
Feature selection procedure. Here, the salinity of multi-aquifers in coastal regions of Bangladesh was modeled based on ten parameters, as reported in Table 1. As stated in literature, optimal input feature selection is one of the most important steps in developing an efficient predictive model with numerous features. On the other hand, linear regression-based methods such as correlation analysis and the best subset approaches may not correctly capture non-linear interactions between input and target parameters. Therefore, adopting an efficient strategy is inevitable. Recently, various strategies have been proposed that are appropriately able to detect nonlinear aspects between data well. In the current research, the Boruta-Random forest feature selection 102 , was Table 2. Selected scenarios of candidate input components for modeling the salinity of groundwater obtained via Burota-random forest feature selection.

ANFIS
Epoch number 220 Step Size Decrease 0.9 Initial Step Size 0.01 Step www.nature.com/scientificreports/ employed as a tree-based powerful feature selection to optimize the input combination and assess the critical degree of each input feature. The outcomes of the Boruta-Random forest feature selection based on the Z-score criterion are illustrated in Fig. 5, which implies the importance of each feature versus salinity. It can be included that Cl -, Na + , and Mg 2+ features have a greater impact on modeling the salinity than the other parameter. Thus, those features were employed in all candidate input combinations. Also, the PH, K + , Depth, and HCO 3 were stood in the next rank and sequentially added to the significant features as the other candidate input combinations. In addition, the PO 4 2− and SO 4 2− regarding fewer Z-scores than the shadow max criterion were ignored to simulate the salinity of the multi-aquifers. The optimal candidate input combinations for more assessment via the predictive models were reported in Table 2. model providing an optimal setting-parameter. Here, five efficient data-intelligent systems, including standalone ANFIS, OA-ANFIS, SMA-ANFIS, ACO-ANFIS, and Lasso-Reg, were examined to estimate the salinity of multi-aquifer in coastal regions of Bangladesh. The ANFIS and hybrid-ANFIS model were developed via Matlab 2019a, and the Lasso-Reg approach was adopted in Python platform 3.8 through the open-source sci-kit-learn library 103 , on a system with Intel Core (TM)i7-6700 CPU with 3.40 GHz. In order to optimize the most significant setting parameter of Lasso-Reg (i.e., Ŵ ), the great search strategy was adopted in the range of Ŵ ∈ (0, 1). Besides, the standalone ANFIS approach was optimized based on a trial and error procedure by examing the cluster number in the range of (3)(4)(5). In the hybrid ANFIS model, membership functions of ANFIS were optimized using three algorithms (e.i., OA, SMA, and ACO) in popuse of the computational cost reduction and accuracy enhancement. Table 3 lists the optimum ANFIS setting parameters, critical values of the algorithm, and the Lasso-Reg tuning parameter. In this research, first, whole datasets (539 data points) are randomly divided into two subsets of training (80%) and testing (20%). Also, to avoid overfitting, the Monte Carlo approach has been employed with 500 runs, and the average results obtained from the Monte Carlo method have been considered. The workflow of salinity estimating the multi-aquifer in coastal regions of Bangladesh is demonstrated in Fig. 6.

Application results and analysis
The selection of the optimal input variables in modeling complex hydrological processes is very crucial. The present study utilized the Boruta-Random forest (B-RF) feature selection technique to nominate the most significant variables which affect the output and cast to develop the efficient data-intelligent models (standalone & hybrid). After the application of B-RF, four scenarios, i.e., C1 with 3 inputs, C2 with 6 inputs, C3 with 7 inputs, and C4 with 8 inputs (Table 2), were built to estimate groundwater's salinity in Bangladesh through standalone & hybrid machine learning (ML) models. The results of the standalone and hybrid ML/data-intelligent models (i.e., ANFIS, ANFIS-OA, ANFIS-SMA, ANFIS-ACO, and Lasso-Reg) under C1 to C4 scenarios were evaluated based on seven statistical indicators including R, RMSE, RAE, E LM , KGE, I A , and U 95% . Table 4 summaries the values of R, RMSE, RAE, E LM , KGE,  Table 4  The results were also appraised using the spider chart in Fig. 7 to compare the performance of ML models in terms of R, RMSE, RAE, E LM , KGE, I A , and U 95% . These figures also clearly demonstrate the superiority of the hybrid ANFIS-OA model over the other ML models. Thus, the outcomes of the applied ML models in C1, C2, C3, and C4 scenarios are ranked in the following order, i.e., ANFIS-OA > ANFIS-SMA > ANFIS-ACO > Lasso-Reg > ANFIS. Optimal combination-RD (%) Figure 11. Relative devotion to the predictive model in the optimal scenario for diagnostic analysis for the testing stage.

Discussion
Accurate monitoring and prediction of soil salinity are essential for sustainable development, land management, water quality, and agricultural activities, especially in arid and semi-arid regions 106 . Therefore, other criteria to examine the performance of applied hybrid and standalone ML models under different scenarios for predicting groundwater salinity are the control rug and density distribution. As mentioned in Table 4, scenario C1 was considered optimal for groundwater salinity prediction in the study area according to the comparison results. So, Fig. 10 demonstrates the control rug and density distribution histogram of ANFIS, ANFIS-OA, ANFIS-SMA, www.nature.com/scientificreports/ ANFIS-ACO, and Lasso-Reg models corresponding to C1 in the testing stage. These figures also confirm the supremacy of the OA algorithm over the others in groundwater salinity prediction.
Similarly, Fig. 11 shows the relative deviation (RD) of predicted groundwater salinity values by ANFIS, ANFIS-OA, ANFIS-SMA, ANFIS-ACO, and Lasso-Reg models for the C1 scenario during the testing period regarding the observed values. The RD was minimum in for the ANFIS-OA (249%) model than ANFIS (488.6%), ANFIS-SMA (278.4%), ANFIS-ACO (679.1%), and Lasso-Reg (603.3%) models. Because of RD%, the ML models attain ANFIS-OA > ANFIS-SMA > ANFIS > Lasso-Reg > ANFIS-ACO pattern. This RD analysis also supports the OA algorithm's effectiveness in optimizing the ANFIS model's performance in groundwater salinity prediction in the study region. Figure 12 displays the temporal variation of groundwater salinity predicted by the ANFIS, ANFIS-OA, ANFIS-SMA, ANFIS-ACO, and Lasso-Reg models corresponding to the C1 optimal scenario in the testing stage. The predicted values of the salinity of groundwater are distributed or plotted concerning the observed values of groundwater salinity (solid black line). The outcomes of the hybrid ANFIS-OA model (dash-dot green line) have much better matching with experimental groundwater salinity values and designate the dominance of the ANFIS-OA model. Figure 13 shows the spatial pattern of groundwater salinity yielded by ANFIS, ANFIS-OA, ANFIS-SMA, ANFIS-ACO, and Lasso-Reg models under the C1 scenario in the testing phase concerning the observed values. The spatial interpolation was done by combining the training and testing datasets using IDW (inverse distance weighted) method. According to these spatial maps, the difference between the observed (left top figure) and the ANFIS, ANFIS-SMA, ANFIS-ACO, and Lasso-Reg models is high but much similar smooth pattern ANFIS-OA model estimates. Thus, the ANFIS with OA algorithm can distinguish the different ranges of groundwater salinity from the other ML models. Wu et al. 107 predicted and mapped the spatial distribution of soil salinity in central Mesopotamia of Iraq by employing the support vector machine (SVM) and random forest regression (RFR) algorithms. They found that the RFR model provided better estimates than the SVM model and stated that the spatial map of soil salinity prepared by the RFR algorithm outcomes helps maintain the agricultural activities and sustainable development in the study area. In addition, the spatial distribution maps of the present study will help understand the vulnerability of salinity in groundwater on a regional scale and adopt preventive measures according to the level of salinity of groundwater.
Furthermore, to make the current research more concur and impactful, results were compared with the existed studies on soil salinity prediction using ML models in a different part of the world 108-110 . Wang et al. 111 predicted soil salinity in three regions, i.e., Qitai, Kuqa, and Yuta oases of China, by using five ML algorithms, including MARS (multiple adaptive regression splines), CART (classification and regression trees), RF (random forest tree ensembles), SGT (stochastic gradient tree boost), and LASSO (least absolute shrinkage and selection operator). According to the results of the comparison, the SGT algorithm was found most suitable for predicting the soil salinity in three distinct regions. Ma et al. 106 applied XGBoost (extreme gradient boosting), CART, and RF models to predict the soil salinity in the Ogan-Kuqa river oasis in China using remote sensing and topographical observations. They found that the XGBoost model achieved better prediction (R 2 = 0.68, RMSE = 10.56 dS m −1 ) than CART (R 2 = 0.57, RMSE = 12.20 dS m −1 ), and RF (R 2 = 0.63, RMSE = 11.41 dS m −1 ) models. Wang et al. 112 employed three ML algorithms, i.e., SVM, ANN, and RF, to predict the soil salinity in China's KPNR (Kongterik Pasture Nature Reserve).
Results show the better performance of SVM (R 2 = 0.88, RMSE = 4.89 dS m −1 ) model than RF (R 2 = 0.27, RMSE = 10.61 dS m −1 ) and ANN (R 2 = 0.57, RMSE = 8.15 dS m −1 ) models. The literature above also established the supremacy of ML algorithms for predicting the soil salinity in different environmental conditions and strongly supports the outcomes of the present study, and endorses the application of hybrid ML models to predict the salinity of groundwater in the study region. The occurrence of major seawater intrusion in the multi-aquifers in the southwest coastal zone was indicated by extremely high salinity levels. Climate change effects such as sea-level rise, storm surges, and water logging, to the best of our knowledge, also increase salt intrusion (Na-Cl type water) along the coast. As a result, salinity-induced water is projected to move further inland, increasing contamination intensity 113 . The lack of surface freshwater supplies, such as downstream river flow, long dry periods, shrimp farming, and rainfall uncertainty, may cause changes in the coastal hydrogeologic environment, causing instability in groundwater recharge, storage, and flow. Our study presents a novel technique to aid water managers and decision-makers in safeguarding groundwater resources against saline water intrusion. Constructing accurate salinity susceptibility maps can lead to increased groundwater resource management stretegies and environmental sustainability. Comparing the current study's results with the previous similar regional investigation demonstrates that the ANFIS-AO regarding higher accuracy (R = 0.945) resulted in the promising outcomes than the Catboost model (R = 0.916) for predicting the groundwater salinity of multi-layer aquifers of Mekong Delta, Vietnam 114 and extreme gradient boosting (EGB) model (R = 0.943) for estimating the groundwater salinity in the southern coastal aquifer of the Caspian Sea, Iran 38 .

Conclusion and future direction
In this research, a novel Nature-inspired ANFIS model (i.e., ANFIS-AO) along with ANFIS-SMA, ANFIS-ACOR, individual ANFIS, and Lasso-Reg coastal multi-aquifers in some regions of Bangladesh based on ten water quality indices promised of depth, pH, Ca 2+ (mg/l), Mg 2+ (mg/l), Na + (mg/l), K + (mg/l), HCO 3 − (mg/l), SO 4 2− (mg/l), PO 4 2− (mg/l), Cl − (mg/l). In the pre-processing stage, the training dataset was explored using the B-RF feature selection, indicating each feature's importance degree in modeling salinity. The outcomes of preprocessing ascertained that SO 4 2− (mg/l) and were neglected as the candidate inputs and Cl − (mg/l), Na + , and Mg 2+ (mg/l) were indicated as the most significant features. Based on the mentioned feature selecting process, four input combinations were examined to assess the compatibility of the predictive ML-based models. A careful Scientific Reports | (2022) 12:11165 | https://doi.org/10.1038/s41598-022-15104-x www.nature.com/scientificreports/ review of the statistical criteria and graphical analysis of the employed ML models shows that the best results are related to the C1, C4, C2, and C3, respectively (C1 > C4 > C2, C3). The hybrid ANFIS-OA approach regarding the most promising metrics (R = 0.9450, RMSE = 1.1253 ppt, KGE = 0.9146, and U 95% = 3.0632) in the testing phase was superior to the ANFIS-SMA (R = 0.9406, RMSE = 1.1534 ppm, KGE = 0.8793, and U 95% = 3.1632), ANFIS-ACOR (R = 0.9402, RMSE = 1.1388 ppm, KGE = 0.8653, and U 95% = 3.1822), Lasso-Reg (R = 0.9358, RMSE = 1.1863 ppm,and KGE = 0.8552), and ANFIS (R = 0.9306, RMSE = 1.2139 ppm, and KGE = 0.9222) models. Besides, the diagnostic assessment of the superior candidate input combination (C1) ascertained that the ANFIS-OA concerning the least RD value (249%) attained the most reliable predicted salinity in coastal multiaquifer followed by the ANFIS-SMA (278.4%), ANFIS (488.6%), ANFIS-ACO (679.1%), and Lasso-Reg (603.3%), respectively. Finally, a comparison of the spatial distribution of salinity in the aquifers of the coastal areas of Bangladesh shows well that the ANFIS-OA method has the best agreement with the actual salinity distribution, while the ANFIS-SMA method is in second place. It is worth noting that the IDW method was employed to interpolate the data points of each predictive model to provide the spatial pattern contours. Since the frameworks presented in this research are based on a robust pre-processing aim to identify the most influential input combinations, the degree of uncertainty of the models will be shallow. Besides, the hybrid neuro-fuzzy models with minor limitations can be used for other water quality indices even in other areas of study. We added this augmentation into the discussion section. The current research was adopted to develop a hybrid version of ML models, and the research findings were successfully approached. Future research direction can be established on different aspects. For instance, studying the data, models, and input parameters uncertainties, investigating the seawater intrusion and its effects on the salinity concentration, and identifying the essential connection between groundwater salinity and corps/ plantations' health and contamination. As another alternative future study, a robust classification method can assess the salinity risks in the entire coastal multi-aquifer system. To lessen and control the deterioration of groundwater quality in the coastal zone, it is critical to avoid leaching salinity intrusion and toxic soil contents into groundwater. The study outcomes can assist the policy-makers and respective agencies in managing and protecting the water resources in the coastal region of Bangladesh. Regionalization of salinity estimation has potential implications for rationally utilizing and developing water resource strategies and plans for reducing vulnerability. The classification methods developed as a future direction can be a possible alternative for salinity assessment in any coastal plain with similar aquifer features and hydrogeologic settings.

Data availability
The data generated or analyzed during this study are available from the corresponding author on reasonable request.